{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Du Machine Learning pour de vrai ?\n", "\n", "Jusqu'ici on a vu plusieurs techniques de machine learning, en particulier :\n", "* des arbres de décision, à partir desquels on peut construire des forêts aléatoires, qui sont une technique d'apprentissage supervisé, permettant de faire de la classification ;\n", "* du clustering, qui permet de regrouper ensemble les points qui se ressemblent le plus ;\n", "* de la projection en espace de plus faible dimension, permettant de visualiser nos points ;\n", "* de la propagation de labels, qui permettent de donner des labels à l'ensemble de nos données à partir d'un petit ensemble de points correctement labellisés.\n", "\n", "Mais on n'a utilisé ces techniques que sur des datasets synthétiques, très simples, et parfois peut-être un peu trop simples.\n", "On va donc voir comment on peut essayer d'utiliser ces techniques pour analyser un jeu de données, on va donc essayer de :\n", "* représenter nos données, malgré leur dimension trop importante pour être représentées directement ;\n", "* regrouper des points ensemble, et voir si on arrive à en dire quelque chose ;\n", "* trouver les paramètres (profondeur des arbres, nombre d'arbres, etc.) qui permettent de prédire des choses à partir de nos données.\n", "\n", "Bien sûr, ces techniques ne sont pas les seules qui existent, ni nécessairement celles qui sont les plus efficaces.\n", "En revanche, elles sont toutes basées sur des méthodes de graphes, en construisant une représentation sous forme de graphe de nos données.\n", "(Et en fait, elles donnent très souvent de très bons résultats ! :) )" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "import seaborn as sns ; sns.set()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Jeu de données\n", "\n", "Cette fois, on va essayer d'apprendre des choses sur des données réelles.\n", "On dispose d'un jeu de données dont chaque ligne représente une enregistrement musical, dont ont été extraite une série de mesures, ainsi que le genre musical.\n", "\n", "Notre objectif final est d'essayer de prédire ce genre musical à partir des autres variables.\n", "\n", "Contrairement aux jeux de données synthétiques qu'on a utilisés jusqu'ici, on va d'abord essayer de comprendre à quoi ressemblent nos données." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Chargement et premières observations\n", "\n", "Commençons par charger nos données à l'aide de la librairie `pandas`." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "data = pd.read_csv(\"dataset.csv\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On peut ensuite regarder la répartition des valeurs de chacune des variables." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | file_name | \n", "zero_crossing | \n", "spectral_centroid | \n", "spectral_rolloff | \n", "spectral_bandwidth | \n", "chroma_frequency | \n", "rmse | \n", "delta | \n", "melspectogram | \n", "tempo | \n", "... | \n", "mfcc11 | \n", "mfcc12 | \n", "mfcc13 | \n", "mfcc14 | \n", "mfcc15 | \n", "mfcc16 | \n", "mfcc17 | \n", "mfcc18 | \n", "mfcc19 | \n", "label | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | \n", "1742 | \n", "1742.000000 | \n", "1742.000000 | \n", "1742.000000 | \n", "1742.000000 | \n", "1742.000000 | \n", "1742.000000 | \n", "1.742000e+03 | \n", "1742.000000 | \n", "1742.000000 | \n", "... | \n", "1742.000000 | \n", "1742.000000 | \n", "1742.000000 | \n", "1742.000000 | \n", "1742.000000 | \n", "1742.000000 | \n", "1742.000000 | \n", "1742.000000 | \n", "1742.000000 | \n", "1742 | \n", "
unique | \n", "1742 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "6 | \n", "
top | \n", "01. Aaj Sraboner Amontrone.mp3 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "nazrul | \n", "
freq | \n", "1 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "312 | \n", "
mean | \n", "NaN | \n", "215007.465557 | \n", "2015.700468 | \n", "4100.426211 | \n", "2043.560009 | \n", "0.305385 | \n", "0.171471 | \n", "-1.743660e-09 | \n", "8.522913 | \n", "123.112315 | \n", "... | \n", "0.117813 | \n", "-5.085212 | \n", "-1.104625 | \n", "-4.488481 | \n", "-0.222948 | \n", "-4.636100 | \n", "0.331630 | \n", "-4.160293 | \n", "-0.704487 | \n", "NaN | \n", "
std | \n", "NaN | \n", "89920.930842 | \n", "721.696480 | \n", "1597.279461 | \n", "666.244323 | \n", "0.072464 | \n", "0.075168 | \n", "1.393930e-07 | \n", "7.374733 | \n", "21.849677 | \n", "... | \n", "5.985646 | \n", "5.253280 | \n", "5.643891 | \n", "4.682304 | \n", "4.659930 | \n", "4.515759 | \n", "4.447436 | \n", "4.531011 | \n", "4.433841 | \n", "NaN | \n", "
min | \n", "NaN | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "-1.676001e-06 | \n", "0.000000 | \n", "73.828125 | \n", "... | \n", "-25.692529 | \n", "-24.642862 | \n", "-26.359570 | \n", "-23.886369 | \n", "-20.935224 | \n", "-21.520218 | \n", "-17.205040 | \n", "-24.743798 | \n", "-16.665104 | \n", "NaN | \n", "
25% | \n", "NaN | \n", "152359.000000 | \n", "1533.507050 | \n", "2962.049702 | \n", "1699.097722 | \n", "0.249591 | \n", "0.118410 | \n", "-6.728298e-08 | \n", "3.576854 | \n", "107.666016 | \n", "... | \n", "-3.046898 | \n", "-8.325668 | \n", "-3.995158 | \n", "-7.407150 | \n", "-2.824771 | \n", "-7.600435 | \n", "-2.115535 | \n", "-6.914308 | \n", "-3.256649 | \n", "NaN | \n", "
50% | \n", "NaN | \n", "196971.500000 | \n", "2033.964745 | \n", "4226.889780 | \n", "2223.573271 | \n", "0.294676 | \n", "0.160642 | \n", "1.617425e-10 | \n", "6.477533 | \n", "123.046875 | \n", "... | \n", "0.643217 | \n", "-4.750612 | \n", "-0.513353 | \n", "-4.537535 | \n", "0.162360 | \n", "-4.908912 | \n", "0.495084 | \n", "-4.428667 | \n", "-0.535849 | \n", "NaN | \n", "
75% | \n", "NaN | \n", "257056.250000 | \n", "2495.077302 | \n", "5258.493696 | \n", "2532.093962 | \n", "0.351839 | \n", "0.214249 | \n", "6.924068e-08 | \n", "11.528968 | \n", "135.999178 | \n", "... | \n", "3.916396 | \n", "-1.735095 | \n", "2.440787 | \n", "-1.597814 | \n", "2.756847 | \n", "-2.071220 | \n", "3.124587 | \n", "-1.830402 | \n", "2.015320 | \n", "NaN | \n", "
max | \n", "NaN | \n", "757737.000000 | \n", "5323.086970 | \n", "8810.877261 | \n", "3252.209261 | \n", "0.616620 | \n", "0.628826 | \n", "6.417627e-07 | \n", "83.923833 | \n", "184.570312 | \n", "... | \n", "20.077568 | \n", "14.958208 | \n", "16.891013 | \n", "24.020983 | \n", "20.451894 | \n", "19.347293 | \n", "19.915842 | \n", "21.322978 | \n", "20.967689 | \n", "NaN | \n", "
11 rows × 31 columns
\n", "\n", " | zero_crossing | \n", "spectral_centroid | \n", "chroma_frequency | \n", "rmse | \n", "delta | \n", "tempo | \n", "mfcc0 | \n", "mfcc2 | \n", "mfcc3 | \n", "mfcc4 | \n", "... | \n", "mfcc10 | \n", "mfcc11 | \n", "mfcc12 | \n", "mfcc13 | \n", "mfcc14 | \n", "mfcc15 | \n", "mfcc16 | \n", "mfcc17 | \n", "mfcc18 | \n", "mfcc19 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | \n", "1.742000e+03 | \n", "1.742000e+03 | \n", "1.742000e+03 | \n", "1.742000e+03 | \n", "1.742000e+03 | \n", "1.742000e+03 | \n", "1.742000e+03 | \n", "1.742000e+03 | \n", "1.742000e+03 | \n", "1.742000e+03 | \n", "... | \n", "1.742000e+03 | \n", "1.742000e+03 | \n", "1.742000e+03 | \n", "1.742000e+03 | \n", "1.742000e+03 | \n", "1.742000e+03 | \n", "1.742000e+03 | \n", "1.742000e+03 | \n", "1.742000e+03 | \n", "1.742000e+03 | \n", "
mean | \n", "2.826862e-16 | \n", "-1.948626e-17 | \n", "-1.574197e-16 | \n", "8.030316e-17 | \n", "-1.657049e-18 | \n", "-7.277314e-16 | \n", "2.516803e-16 | \n", "1.325958e-16 | \n", "7.459908e-17 | \n", "1.218569e-16 | \n", "... | \n", "-4.336410e-17 | \n", "9.005426e-17 | \n", "2.137594e-16 | \n", "-2.813160e-16 | \n", "2.272070e-17 | \n", "1.666609e-16 | \n", "9.630006e-17 | \n", "-5.697700e-17 | \n", "-1.165511e-16 | \n", "-9.572646e-17 | \n", "
std | \n", "1.000287e+00 | \n", "1.000287e+00 | \n", "1.000287e+00 | \n", "1.000287e+00 | \n", "1.000287e+00 | \n", "1.000287e+00 | \n", "1.000287e+00 | \n", "1.000287e+00 | \n", "1.000287e+00 | \n", "1.000287e+00 | \n", "... | \n", "1.000287e+00 | \n", "1.000287e+00 | \n", "1.000287e+00 | \n", "1.000287e+00 | \n", "1.000287e+00 | \n", "1.000287e+00 | \n", "1.000287e+00 | \n", "1.000287e+00 | \n", "1.000287e+00 | \n", "1.000287e+00 | \n", "
min | \n", "-2.391759e+00 | \n", "-2.793805e+00 | \n", "-4.215485e+00 | \n", "-2.281819e+00 | \n", "-1.201451e+01 | \n", "-2.256250e+00 | \n", "-1.159683e+01 | \n", "-4.554401e+00 | \n", "-4.679039e+00 | \n", "-4.716782e+00 | \n", "... | \n", "-4.554122e+00 | \n", "-4.313277e+00 | \n", "-3.724010e+00 | \n", "-4.476025e+00 | \n", "-4.143998e+00 | \n", "-4.446038e+00 | \n", "-3.740006e+00 | \n", "-3.944229e+00 | \n", "-4.544111e+00 | \n", "-3.600761e+00 | \n", "
25% | \n", "-6.969062e-01 | \n", "-6.683307e-01 | \n", "-7.701652e-01 | \n", "-7.060944e-01 | \n", "-4.703116e-01 | \n", "-7.071379e-01 | \n", "-6.171237e-01 | \n", "-6.065826e-01 | \n", "-3.884310e-01 | \n", "-5.633696e-01 | \n", "... | \n", "-5.776729e-01 | \n", "-5.288686e-01 | \n", "-6.170215e-01 | \n", "-5.122998e-01 | \n", "-6.235195e-01 | \n", "-5.584997e-01 | \n", "-6.566306e-01 | \n", "-5.503997e-01 | \n", "-6.079892e-01 | \n", "-5.757751e-01 | \n", "
50% | \n", "-2.006334e-01 | \n", "2.531469e-02 | \n", "-1.478245e-01 | \n", "-1.441053e-01 | \n", "1.367321e-02 | \n", "-2.995879e-03 | \n", "1.308499e-01 | \n", "1.638622e-01 | \n", "1.296066e-02 | \n", "9.521434e-02 | \n", "... | \n", "6.462766e-02 | \n", "8.780251e-02 | \n", "6.371191e-02 | \n", "1.047931e-01 | \n", "-1.047961e-02 | \n", "8.270927e-02 | \n", "-6.043073e-02 | \n", "3.676296e-02 | \n", "-5.924763e-02 | \n", "3.804522e-02 | \n", "
75% | \n", "4.677538e-01 | \n", "6.644268e-01 | \n", "6.412391e-01 | \n", "5.692530e-01 | \n", "5.093852e-01 | \n", "5.899658e-01 | \n", "7.070439e-01 | \n", "7.036774e-01 | \n", "5.201561e-01 | \n", "6.215948e-01 | \n", "... | \n", "6.500392e-01 | \n", "6.347975e-01 | \n", "6.379022e-01 | \n", "6.283661e-01 | \n", "6.175373e-01 | \n", "6.396343e-01 | \n", "5.681474e-01 | \n", "6.281729e-01 | \n", "5.143577e-01 | \n", "6.135962e-01 | \n", "
max | \n", "6.037364e+00 | \n", "4.584110e+00 | \n", "4.296243e+00 | \n", "6.086163e+00 | \n", "4.617816e+00 | \n", "2.813572e+00 | \n", "2.832922e+00 | \n", "3.292609e+00 | \n", "4.427734e+00 | \n", "3.548240e+00 | \n", "... | \n", "3.347441e+00 | \n", "3.335561e+00 | \n", "3.816506e+00 | \n", "3.189432e+00 | \n", "6.090517e+00 | \n", "4.438002e+00 | \n", "5.312568e+00 | \n", "4.404748e+00 | \n", "5.625806e+00 | \n", "4.889305e+00 | \n", "
8 rows × 25 columns
\n", "